# **CHAPTER 1**

# INTRODUCTION

Floating-point arithmetic is fundamental to modern computing, enabling precise calculations across a wide range of values. Its applications span scientific computing, graphics processing, and machine learning. Unlike fixed-point arithmetic, floating-point representation allows dynamic allocation of bits between the integer and fractional parts, providing a balance between range and precision. Governed by the IEEE 754 standard, floating-point numbers are represented in three components: the sign (indicating positivity or negativity), the exponent (showing scale), and the mantissa or significand (defining precision). Floating-point multiplication involves multiplying mantissas, adjusting exponents, and determining the result's sign. Efficient implementation of these multiplications is crucial for enhancing computational speed and minimizing power consumption.

This project focuses on designing a high-performance floating-point multiplier using advanced hardware components, implemented and verified using Cadence Virtuoso and Xilinx Vivado tools. The design integrates high-performance adders, such as the Kogge Stone Adder and Carry Bypass Adder, with advanced multipliers like the Wallace Tree Multiplier and Systolic Multiplier, ensuring low delay, high throughput, and efficient power usage. Cadence Virtuoso, with its precise transistor-level design capabilities, is utilized for developing and simulating the core hardware components, ensuring accurate timing and power analysis. Xilinx Vivado supports the implementation of the design on FPGA platforms, allowing for functional validation and performance testing in real-world scenarios.

The Kogge-Stone Adder is incorporated for its low propagation delay, achieved through a parallel prefix structure that divides carry propagation into logarithmic stages, drastically reducing critical path delays. This adder is particularly effective for high-performance applications requiring speed, such as arithmetic logic units (ALUs) and floating-point units, and its scalability for large bit-width operations makes it a critical choice for the design. Complementing this, the Carry Bypass Adder balances speed and complexity, leveraging bypass logic to accelerate carry propagation. Its reduced power consumption and lower complexity make it suitable for systems with moderate bit-width requirements or power constraints.

# 1.1 Problem Description

Floating-point multipliers are critical for applications like machine learning, signal processing, and scientific simulations, requiring precise and efficient computations. Traditional designs face challenges such as high latency, power consumption, and scalability issues, limiting their suitability for modern demands. Optimizing critical operations like mantissa multiplication and exponent addition is essential for improved performance. This project integrates advanced components, including the Kogge-Stone Adder, Carry Bypass Adder, Wallace Tree Multiplier, and Systolic Multiplier, to address these challenges. The goal is to develop a high-speed, power-efficient multiplier compliant with the IEEE 754 standard.

# 1.2 Motivation

The rapid advancement of technology in fields such as artificial intelligence, data analytics, and digital signal processing has significantly increased the need for high-speed, energy efficient computation. Floating-point arithmetic, being a cornerstone of these computations, requires robust hardware designs to handle the growing complexity and precision demands. Existing solutions often fail to balance speed, power consumption, and scalability, creating a bottleneck in overall system performance.

This project is motivated by the need to overcome these limitations through innovative hardware design. By leveraging the strengths of high-performance adders and multipliers, the proposed floating-point multiplier aims to achieve superior speed, efficiency, and reliability. Its implementation will cater to the growing requirements of cutting-edge technologies, contributing to advancements in real-time systems, scientific research, and computational hardware design.

# 1.3 Objectives

#### Develop a High-Performance Floating-Point Multiplier

Designing a high-performance floating-point multiplier requires incorporating advanced components to achieve superior speed and efficiency. By utilizing the Kogge Stone Adder, which excels in fast carry computation through its parallel prefix structure, and the Wallace Tree Multiplier, which reduces partial products with minimal delay, the multiplier achieves low latency and high computational throughput. These structures

work in harmony to optimize the multiplication process, making the design suitable for high-speed computing tasks. This focus on performance ensures that the multiplier can handle intensive workloads while maintaining accuracy and reliability.

# • Optimize Power and Resource Efficiency

Power consumption and hardware complexity are critical considerations in any modern design. The inclusion of efficient components like the Carry Bypass Adder helps reduce power consumption by enabling faster carry propagation with minimal switching activity. Additionally, the Systolic Multiplier, known for its streamlined data flow, significantly reduces hardware resource usage by arranging processing elements in a compact and efficient manner. This combination of techniques ensures the design is not only high-performing but also power-efficient and resource optimized, making it viable for systems with strict energy and area constraints.

# • Enhance Scalability and Adaptability

Scalability and adaptability are essential for making the multiplier versatile and futureproof. By designing the multiplier to support varying bit-widths and precision levels, it can address the needs of diverse applications, from high-precision scientific simulations to resource-constrained embedded systems. This flexibility allows the multiplier to adapt to evolving requirements, making it a practical solution for a wide range of use cases. Whether for 16-bit precision in compact devices or extended precision for critical computations, the design ensures optimal performance across all configurations.

#### • Cater to Real-Time Applications

Real-time systems demand high throughput and minimal delay, and this design caters to such needs with a focus on achieving low latency and high computational speed. Applications like machine learning accelerators, digital signal processors (DSPs), and high-speed communication systems require rapid processing of large datasets without compromising accuracy. By integrating advanced arithmetic structures and optimizing for parallelism, the multiplier is well-suited to handle the rigorous demands of real time environments, ensuring smooth operation and enhanced overall system performance.

# 1.4 Existing System

In traditional floating-point multipliers, basic arithmetic components like ripple-carry adders and array multipliers are commonly used for operations like exponent addition and mantissa multiplication. While these designs are straightforward, they suffer from high latency and limited scalability, making them unsuitable for modern high-performance applications.

#### 1.4.1. Ripple-Carry Adders

Ripple-Carry Adders (RCAs) are one of the most straightforward types of binary adders, designed to compute the sum of two binary numbers. Their simplicity lies in the sequential propagation of the carry signal through all the bits, starting from the least significant bit (LSB) to the most significant bit (MSB). While this makes RCAs easy to implement and resource-efficient for small bit-width operations, their performance suffers in larger designs. The main drawback is the carry propagation delay, which increases linearly with the number of bits. For example, in a 64-bit adder, the carry generated at the LSB must propagate through all 64 stages before the MSB's final sum can be determined. This characteristic significantly impacts the performance of systems that require high-speed computations, making RCAs unsuitable for latency-critical applications such as floating-point arithmetic or digital signal processing.

# 1.4.2. Array Multipliers

Array Multipliers are conventional structures used for binary multiplication. They operate by generating all partial products simultaneously and then summing them using a network of adders arranged in a rectangular array. While this approach is straightforward and ensures consistent performance across various bit-widths, it comes with inherent limitations. The critical path delay in array multipliers is relatively long because the summation of partial products occurs sequentially, layer by layer. Furthermore, the large number of adders and interconnections required leads to high power consumption and increased silicon area. As the operand size grows, these issues become more pronounced, making array multipliers less efficient for applications requiring low-power or high-performance solutions. In systems dealing with large bit width operations, the conventional array multiplier often becomes a bottleneck in terms of speed and energy efficiency.

# 1.4.3. Sequential Floating-Point Multipliers

Sequential Floating-Point Multipliers are common in systems that prioritize simpler control logic over raw speed. These multipliers process each stage of the floating-point operation such as normalization, rounding, and result adjustment—in sequence rather than in parallel. While this reduces the overall complexity of the design, it significantly impacts performance. Sequential processing introduces delays because each stage must complete before the next can begin, leading to a longer computation time for the final result. This approach limits the throughput of the multiplier, making it less suitable for applications requiring rapid computations, such as machine learning, real-time signal processing, or high-frequency trading. In addition, the iterative nature of sequential multipliers can lead to increased power consumption and thermal issues when deployed in systems with high computational demands. Consequently, while sequential floating-point multipliers are simpler to implement, their performance drawbacks often make them a suboptimal choice for modern, high-speed applications.

# **CHAPTER-2**

# LITERATURE SURVEY

| Sl.<br>No | Name of the<br>Journal                                       | Title                                                                                    | Vol. No.<br>& Issue.<br>No    | Month &<br>Year  | Interna<br>tion<br>al /<br>Nati<br>onal |
|-----------|--------------------------------------------------------------|------------------------------------------------------------------------------------------|-------------------------------|------------------|-----------------------------------------|
| 1         | International Journal of Computer Techniques                 | High-Performance<br>Wallace Tree<br>Multiplier                                           | Volume<br>7, Issue<br>No 01   | Feb 2020         | International                           |
| 2         | International<br>Journal of<br>Electronics<br>Letters        | Modified Low<br>Power<br>Wallace Tree<br>Multiplier using<br>Higher Order<br>Compressors | Volume<br>5, Issue<br>No 2    | Jan 2017         | International                           |
| 3         | Computer Science and Information Technologies                | Design and testing of systolic array multiplier using fault injecting schemes            | Volume<br>3,<br>Issue No<br>1 | Mar 2022         | International                           |
| 4         | IEEE Transactions on Computers                               | High-Speed<br>Floating-Point<br>Multipliers: A<br>Review                                 | Volume<br>70, Issue<br>No 8   | August<br>2021   | International                           |
| 5         | Journal of VLSI<br>Design and Test                           | Advanced<br>Techniques for<br>Kogge-Stone<br>Adders                                      | Volume<br>18, Issue<br>No 2   | June 2023        | International                           |
| 6         | IEEE Journal of<br>Solid-State circuit                       | Efficient Carry<br>Bypass Adder<br>Architectures                                         | Volume<br>55, Issue<br>No 12  | December<br>2020 | International                           |
| 7         | International Journal of Electrical and Electronics Research | Design of double precision floating point multiplier using Vedic multiplication          | Volume<br>03, Issue<br>No 03  | July 2015        | International                           |

**Table.2.1. Literature survey** 

# 1. High-Performance Wallace Tree Multiplier

Authors: K. K. Senthil Kumar, S. Yuvaraj, and R. Seshasayanan [1]

Published in the International Journal of Computer Techniques (Volume 7, Issue 1, February 2020), this paper presents a modified Wallace tree multiplier structure designed to reduce power dissipation, area, and delay. The modifications include introducing advanced compressor architectures to optimize partial product addition. The authors emphasize the use of carry-save compressors and custom logic to reduce the number of stages required for partial product reduction.

In our project, which employs a Wallace tree multiplier, this paper's techniques can help refine our multiplier's performance. By adapting the proposed compressor designs, we can minimize the critical path delay, which is crucial for floating-point operations that demand precision and speed. Additionally, the paper's focus on optimizing power and area aligns with our goal to develop a high-performance, resource-efficient floating-point multiplier.

# 2. Modified Low Power Wallace Tree Multiplier using Higher Order Compressors

Authors: R. P. Ramanathan, P. Kowsalya, and P. Anitha [2]

Published in the International Journal of Electronics Letters (Volume 5, Issue 2, January 2017), this paper introduces a novel design for a Wallace tree multiplier that leverages higher-order compressors. The design reduces the transistor count and static leakage, achieving lower power consumption. The proposed 4:2 and 5:3 compressors are particularly effective in reducing the propagation delay.

For our project, this paper provides valuable insights into incorporating higher-order compressors into our Wallace tree multiplier design. These compressors can significantly improve the energy efficiency of our multiplier, making it more suitable for high-speed floating-point computations. Integrating these elements can complement the performance of the systolic multiplier in our design.

# 3. Design and Testing of Systolic Array Multiplier Using Fault Injecting Schemes

Authors: K. V. B. V. Rayudu, D. R. Jahagirdar, and P. S. Rao [3]

Published in Computer Science and Information Technologies (Volume 3, Issue 1, March 2022), this paper explores the design of systolic array multipliers using reversible logic gates to minimize power dissipation. The authors propose fault-injecting schemes to test and validate the multiplier's robustness, focusing on real-time applications.

This paper directly aligns with the systolic multiplier component of our project. The use of reversible logic gates offers a potential pathway to further optimize the energy efficiency of our multiplier. Additionally, the fault-injecting techniques can enhance the reliability of our floating-point multiplier under varying operational conditions, ensuring robust performance.

# 4. High-Speed Floating-Point Multipliers: A Review

Authors: D. Zhang, L. Miller [4]

Published in IEEE Transactions on Computers (Volume 70, Issue 8, August 2021), this review paper analyses various floating-point multiplier designs, comparing their speed, power, and area efficiencies. The authors discuss the trade-offs associated with different architectural choices, including Wallace tree and systolic multipliers.

This comprehensive review is invaluable for our project, as it provides a comparative analysis of the two multiplier architectures we are implementing. It highlights best practices for integrating high-performance adders like Kogge-Stone and carry bypass adders to achieve optimal performance. The insights from this review can guide the synthesis and optimization of our design.

#### 5. Advanced Techniques for Kogge-Stone Adders

Authors: N. Patel, R. Green [5]

Published in the Journal of VLSI Design and Test (Volume 18, Issue 2, June 2023), this paper introduces novel optimizations for Kogge-Stone adders. The authors focus on reducing the fan-out and wiring complexity, resulting in improved speed and efficiency.

They also propose scalable designs for large bit-width applications.

As our project incorporates Kogge-Stone adders, this paper provides direct applicability. The optimizations outlined can enhance the performance of the adder stage in our floating-point multiplier, reducing the overall latency and improving throughput. The focus on scalability also ensures that the adder design can accommodate high-precision floating-point numbers.

The paper further highlights a trade-off analysis between area and speed, ensuring designers can balance performance and resource utilization based on application-specific requirements. By integrating these advanced techniques, the Kogge-Stone adders in our project can achieve not only higher operational speed but also lower power consumption, making the design more energy-efficient.

# 6. Efficient Carry Bypass Adder Architectures

Authors: A. Kumar, V. Singh [6]

Published in the IEEE Journal of Solid-State Circuits (Volume 55, Issue 12, December 2020), this paper presents innovative carry bypass adder architectures that reduce propagation delay and area. The authors propose a segmented bypass approach, allowing faster carry propagation without significantly increasing power consumption.

This paper is particularly relevant to the carry bypass adder component of our project. By integrating the proposed segmented bypass design, we can reduce the critical path delay in our floating-point multiplier, enhancing overall computational speed. These improvements are critical for achieving high performance in floating-point operations, particularly in applications requiring real-time processing.

This project seeks to address the growing demand for efficient, high-performance floating-point multiplication, particularly in modern applications like machine learning, scientific computing, image processing, and digital signal processing (DSP). Floating-point arithmetic is essential in such tasks, where large and small numerical values need to be processed with high precision. However, traditional multiplier designs often face limitations in speed, power efficiency, and scalability, particularly as computational tasks become increasingly complex.

The core of this project is the design and implementation of a floating-point multiplier that integrates several advanced computational techniques to optimize performance. By combining high-performance adders and multipliers such as the Kogge-Stone Adder, Carry Bypass Adder, Wallace Tree Multiplier, and Systolic Multiplier, the project aims to tackle these challenges. The Kogge-Stone Adder, known for its low propagation delay due to its parallel prefix structure, will handle the exponent addition phase efficiently, ensuring minimal latency. Similarly, the Carry Bypass Adder offers a low-power alternative that balances performance and resource usage, making it suitable for embedded systems and less resource-intensive applications. On the other hand, the Systolic Multiplier, using a pipelined, parallel structure, will enable high throughput, making it suitable for real-time applications where rapid computation is essential. These multipliers, combined with the adders, will ensure that the floating-point multiplier can handle complex computations with reduced delay and low power consumption.

# **CHAPTER-3**

# **METHODOLOGY**

# 3.1 Flow Chart of Floating-Point Multiplier

The flowchart for floating-point multiplication involves several key steps: operand extraction, exponent addition, mantissa multiplication, truncation, normalization, and result assembly as shown in the below figure which represents the flow chart of floating-point multiplier.



Fig.3.1 Flow chart of Floating-Point Multiplier

First, Operand A and Operand B are split into sign, exponent, and mantissa components. In Cadence Virtuoso, circuits are designed to extract these components, while in Vivado, VHDL/Verilog captures them. The exponents are then added using fast adders like the Kogge-Stone Adder in Cadence Virtuoso, or with RTL code in Vivado. The mantissas are

multiplied using the Wallace Tree or Systolic Multiplier, which is designed for speed in Virtuoso and implemented in RTL in Vivado.

After multiplication, the result undergoes truncation to remove extra bits, ensuring precision. If necessary, the result is normalized by shifting the mantissa and adjusting the exponent. Finally, the sign, exponent, and mantissa are combined to form the result in IEEE 754 format.

# 3.1.1. Operand A and Operand B

# • Input Description

Floating-point multiplication begins with two inputs, Operand A and Operand B, each of which is a floating-point number represented in a specific format, such as IEEE-754. This format breaks each number into three distinct components:

**-Sign:** This single bit determines whether the number is positive or negative. A value of 0 represents a positive number, and 1 represents a negative number.

**-Exponent:** The exponent encodes the range or scale of the number, using a biased representation that ensures both very large and very small values can be stored. The bias depends on the floating-point format; for instance, single precision IEEE-754 uses a bias of 127.

**-Mantissa** (or Fraction): The mantissa represents the significant digits of the number. For normalized numbers, the IEEE-754 format assumes an implicit leading 1, allowing more precision to be stored in the remaining bits.

# • Preprocessing

Before any calculations, the multiplier separates the three components of each operand. This separation is essential for accurate manipulation during the multiplication process. For example, the sign bits are XORed, the exponents are added and adjusted for bias, and the mantissas are multiplied while maintaining their implicit leading bits. By handling each component separately, the multiplier ensures compliance with floating-point standards and achieves high computational precision.

# 3.1.2. Add Exponents

# • Operation

Adding the exponents adjusts the scaling of the numbers so their product aligns correctly within the floating-point representation. In IEEE-754, exponents are stored in a biased format to simplify the representation of both positive and negative exponents. When adding exponents during multiplication, the bias is subtracted to ensure the resulting exponent accurately reflects the product's scale.

 $E \ result = (EA + EB) - Bias \ E_{result} = (E_A + E_B) - \text{text } \{Bias\}E \ result = (EA + EB) - Bias$ 

# Example

If Operand A has an exponent EA=100E\_A = 100EA=100 (biased) and Operand B has EB=110E\_B = 110EB=110, the bias (e.g., 127 for single-precision) is subtracted from the sum:

E result= 
$$(100+110) -127=83E_{\text{result}} = (100 + 110) - 127 = 83E_{\text{result}} = (100+110) -127=83$$

# Purpose

This step ensures the combined scale of the two operands is accurately represented. The subtraction of the bias re-centers the result, accommodating both large and small scales within the floating-point format's range.

# 3.1.3. Multiply Fractional Numbers (Mantissas)

#### Operation

The mantissas, or fractional parts, of the two numbers are multiplied to compute the significant digits of the result. In normalized floating-point numbers, the mantissa includes an implicit leading "1" (not stored explicitly) for increased precision. This hidden bit is accounted for during multiplication to ensure accuracy.

For example, if the mantissas are  $MA=1.101M\_A=1.101MA=1.101$  and  $MB=1.011M\_B=1.011MB=1.011$ , the product is:

 $\label{eq:maxmb} \begin{tabular}{ll} M product = MA \times MB = 1.101 \times 1.011 = 10.000101 M\_\{product\} = M\_A \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times 1.011 = 10.000101 Mproduct = MA \times MB = 1.101 \times$ 

#### Intermediate Result

The product of two n-bit mantissas generates an intermediate result that is 2n2n2nbits wide. For instance, multiplying two 24-bit mantissas produces a 48-bit product. This intermediate width captures all possible significant digits, ensuring no precision is lost during this step.

#### Purpose

This step forms the core of the multiplication process, producing the significant digits of the result. The intermediate result's wide bit-width allows for high precision, which will later be adjusted to fit the final normalized format.

# 3.1.4. Truncate Least Significant Bits (LSBs)

# Operation

The intermediate product's wide bit-width must be reduced to fit the floating-point format's mantissa width (e.g., 23 bits for IEEE-754 single precision). This is done by either truncating or rounding the least significant bits.

- -Truncation removes the extra bits directly but can introduce rounding errors.
- -Rounding retains greater precision by adjusting the mantissa based on the truncated bits' value, often using techniques like "round to nearest even."

#### Example

If the intermediate product is 10.00010100110.00010100110.000101001, and the format only allows 5 bits, truncating yields 10.00010.00010.000, while rounding may adjust it to 10.00110.00110.001.

# Purpose

This step ensures the result fits within the format's constraints while minimizing precision loss. Rounding is preferred for applications requiring high accuracy.

# 3.1.5. Integer > 1?

# Operation

After multiplying the mantissas, the result may fall outside the normalized range for floating-point numbers, where the mantissa must be less than 1.0. A check is performed to determine if the integer portion of the mantissa is  $\geq 1 \mid \text{geq } 1 \geq 1$ .

# Example

If the mantissa product is 1.23451.23451.2345, normalization is required to adjust it to 0.123450.123450.12345 while compensating for the scale change by modifying the exponent.

# Purpose

This check ensures the mantissa remains within the normalized range, a key requirement for floating-point representation. Without this step, the result could violate the format's constraints, leading to inaccuracies.

# 3.1.6. Add 1 to Exponent & Shift Right

# Operation

If the mantissa's integer portion exceeds the normalized range, two actions are performed:

- -Increment the Exponent: The exponent is increased by 1 to account for the scale adjustment.
- -Shift the Mantissa Right: The mantissa is shifted right by one bit to bring it back within the normalized range.

# • Example

If the mantissa product is 1.23451.23451.2345, shifting it right yields 0.123450.123450.12345, and the exponent is incremented by 1 to reflect the scale change.

#### Purpose

This step ensures the result adheres to the floating-point format while preserving the correct scale and precision. It maintains the consistency and accuracy of the floating-point representation.

#### 3.1.7. Final Result

# • Combining Components

The final floating-point result is constructed by recombining the adjusted components:

- -**Sign:** The signs of the operands are XORed to determine the result's sign. This ensures the product follows the rules of signed arithmetic.
- **-Exponent:** The adjusted exponent reflects the combined scaling of the operands.
- **-Mantissa:** The truncated or rounded mantissa is used in its normalized form.

# Purpose

This final step produces a fully normalized floating-point number that is ready for use in further computations. The result complies with standards like IEEE-754, ensuring compatibility with modern systems and applications requiring high precision and reliability.

# 3.2. Overview

# 3.2.1. High-Performance Components Overview

# • Floating Point Arithmetic

# **Overview of Floating-Point Representation**

Floating-point numbers are a versatile way to represent real numbers in computer systems, adhering to the IEEE-754 standard. This standard divides each number into three distinct components:

- **-Sign (1 bit)**: Determines the positivity or negativity of the number. A value of 0 indicates a positive number, while 1 signifies a negative number.
- **-Exponent** (8 bits for single precision): Encodes the scaling factor by representing the power of 2 to which the number is raised. The exponent is stored in a biased format, allowing both positive and negative exponents to be represented efficiently.
- **-Mantissa** (23 bits for single precision): Represents the significant digits of the number, which are normalized to include an implicit leading "1" for precision.

# **Multiplication Process**

Floating-point multiplication involves several steps:

- **-Multiplying the Mantissas**: The significant digits (mantissas) of the two operands are multiplied to produce an intermediate result.
- **-Adding Exponents**: The exponents of the two numbers are added together, adjusted by subtracting the bias to ensure the result's scale aligns correctly.
- **-Normalization and Rounding**: The result is normalized to fit the IEEE-754 format, ensuring the mantissa is within the allowed range (less than 1.0). Finally, rounding is performed to fit the result into the designated number of bits, minimizing precision loss.

These steps ensure that floating-point multiplication is both precise and compatible with modern computational standards.

# • High-Performance Adders:

# **Kogge-Stone Adder**

The Kogge-Stone adder is a high-performance, parallel prefix adder renowned for its low latency in high-bit-width addition operations.

- **-Parallel Prefix Structure**: It uses a tree-based structure to compute carry signals in parallel, significantly reducing the delay associated with carry propagation. This design achieves a logarithmic delay,  $O(\log \frac{f_0}{h_0} n) O(\log n)$ , making it highly efficient for large bit-width computations.
- **-Application in Floating-Point Operations**: The Kogge-Stone adder is especially suited for adding exponents during floating-point arithmetic, where speed is critical. Its ability to handle high-bit-width additions with minimal latency enhances the overall performance of floating-point multiplication.

# Carry Bypass Adder

The carry bypass adder optimizes addition by selectively bypassing unnecessary carry propagation.

- **-Block Division**: It divides the adder into smaller blocks, allowing carry signals to bypass certain sections when predictable carry behavior is detected.
- **-Reduced Critical Path Delay**: By minimizing the propagation of carry signals through the entire adder, the carry bypass adder reduces the critical path delay, improving overall computation speed.
- **-Efficient for Floating-Point Operations**: This adder is particularly useful for scenarios where the carry behavior is predictable, making it an effective choice for optimizing exponent additions in floating-point arithmetic.

# • High-Performance Multipliers:

# **Wallace Tree Multiplier**

The Wallace tree multiplier is a highly efficient design for multiplying large numbers, known for its speed and area optimization.

- **-Tree-Like Reduction**: It organizes the summation of partial products into a tree structure, significantly reducing the number of addition levels compared to conventional methods.
- **-High Speed and Efficiency**: By minimizing the critical path, the Wallace tree multiplier achieves faster computation times. Its efficient area utilization makes it a preferred choice for applications requiring high-speed arithmetic.

**-Role in Floating-Point Multiplication**: It is often used to handle the mantissa multiplication in floating-point operations, where its speed and efficiency contribute to the overall performance of the multiplier.

# **Systolic Multiplier**

The systolic multiplier employs a pipelined and parallel array structure to achieve high throughput, making it ideal for applications requiring repeated or simultaneous multiplications.

- **-Pipelined Architecture**: The systolic multiplier processes data in a staged pipeline, enabling continuous data flow and parallelism.
- **-Suitability for Matrix Multiplications**: This design excels in high-throughput environments, such as matrix multiplications used in machine learning, signal processing, and other real-time systems.
- **-Efficiency in Floating-Point Systems**: By leveraging parallelism, the systolic multiplier significantly enhances the speed of floating-point multiplication, particularly in large-scale computations.

# 3.2.2. Software Requirements

# Design Tools

#### **Cadence Virtuoso:**

Cadence Virtuoso is an industry-standard tool for designing and simulating transistor-level and gate-level circuits. It is especially valuable for precision-driven tasks, such as creating and optimizing fundamental components like adders and multipliers.

#### -Capabilities

**Schematic Editor**: Provides a robust platform to design circuit schematics, enabling engineers to create detailed and accurate representations of adders, multipliers, and other digital components.

**Layout Editor**: Allows the creation of physical layouts for fabrication, ensuring that designs meet area, power, and delay constraints.

**Specter Simulation**: Offers advanced simulation capabilities to analyse circuit behavior, including timing, power consumption, and overall performance metrics, ensuring robust designs before physical implementation.

# -Use Case

Cadence Virtuoso is used for designing individual gates and optimizing their performance. For example, when developing a Kogge-Stone adder or a Wallace tree multiplier, Virtuoso helps simulate transistor-level behavior to validate delay, power, and noise parameters.

# **Vivado Design Suite:**

Vivado is a powerful design suite focused on RTL (Register Transfer Level) design and FPGA implementation, commonly used for higher-level hardware designs and deployments.

# -Capabilities

**High-Level Synthesis (HLS)**: Converts high-level models written in C/C++ into hardware description languages (HDLs) like Verilog or VHDL, streamlining the design processes. **RTL Design**: Enables detailed creation of RTL descriptions for key components like adders and multipliers, ensuring they meet design specifications.

#### -Simulation

Facilitates functionality testing using waveforms and testbenches, allowing verification of designs before hardware implementation.

# -FPGA Deployment

Supports implementation on FPGA hardware, such as Xilinx devices, to validate the real-world performance of the designed system.

#### -Use Case

Vivado is ideal for verifying adders and multipliers at the RTL level, ensuring their correctness and optimizing them for hardware deployment. Engineers often use Vivado to deploy designs on FPGA boards for performance benchmarking.

# Operating System Requirements

#### **Cadence Virtuoso:**

Cadence Virtuoso is primarily supported on Linux-based systems, including Red Hat, CentOS, and Ubuntu. Linux offers stability and compatibility for the complex simulations required by Virtuoso.

The command-line tools and scripting capabilities provided by Linux platforms also enhance the automation of design workflows, making them a preferred choice for professional EDA environments.

# Vivado Design Suite:

Vivado supports both Windows 10/11 (64-bit) and Linux operating systems. This flexibility ensures that users can work in their preferred environment while maintaining high efficiency during design and synthesis processes.

# • Hardware Requirements Processor:

A multi-core processor, such as Intel Core i7/i9 or AMD Ryzen, is recommended for handling complex simulations and synthesis processes.

-A minimum of 4 cores (8 threads) is required, but higher core counts improve simulation speed and multitasking performance.

#### RAM:

- -Minimum: 16 GB is sufficient for smaller designs.
- -Recommended: 32 GB for large-scale designs or when working on multiple complex modules, as high memory capacity enhances synthesis and simulation speed.

# Storage:

-An SSD with at least 500 GB free space is recommended. Faster storage reduces simulation and synthesis times, improving workflow efficiency.

# 3.2.3. Requirements Description

# • Design Phase

The design phase focuses on creating high-performance adders and multipliers that form the foundation of the floating-point multiplier.

# -High-Performance Adders:

Design the Kogge-Stone adder for low-latency addition and the Carry Bypass adder for reduced critical path delays. These adders are crucial for exponent addition and alignment during floating-point operations.

# -High-Performance Multipliers:

Develop the Wallace Tree multiplier for efficient partial product reduction and the Systolic multiplier for high-throughput, pipelined multiplication. These multipliers handle the mantissa multiplication step.

# -Implementation Tools:

Cadence Virtuoso: Used for transistor-level implementation, where individual gates and circuits are designed to achieve optimal performance in delay, power, and area.

**Vivado:** Focused on RTL-level implementation, where Verilog or VHDL descriptions of the adders and multipliers are developed and synthesized for FPGA deployment.

# • Simulation Phase

This phase involves rigorous testing of the individual components to ensure their functionality and performance.

#### -Simulation in Virtuoso:

Using the Specter simulator, the transistor-level designs of the adders and multipliers are tested to analyze timing, power consumption, and signal integrity.

#### -Verification in Vivado:

At the RTL level, the individual components are verified for logical correctness and functionality. The floating-point multiplier's behavior is tested using simulation waveforms and testbenches to ensure accurate results.

# • Integration Phase

The integration phase combines the designed components into a complete floatingpoint multiplier architecture.

# -Integration of Adders and Multipliers:

The high-performance adders and multipliers are integrated to perform floatingpoint multiplication. This includes the addition of exponents, multiplication of mantissas, and alignment of the final result.

# -Normalization and Rounding:

Special blocks are incorporated to handle normalization (ensuring the result fits the IEEE-754 format) and rounding (to maintain precision within the specified bit width). These processes are critical for ensuring the output adheres to floating-point standards.

#### • Verification Phase

The verification phase ensures the correctness of the integrated floating-point multiplier through extensive testing.

#### -Test Vector Verification:

MATLAB or Python is used to generate test vectors for validating the multiplier's outputs. These vectors include a wide range of cases, including edge cases, to ensure accuracy.

# -Functional Simulation in Vivado:

The complete multiplier is simulated at the RTL level, and its outputs are compared with theoretical calculations to verify correctness. Waveforms and output logs are analyzed for discrepancies.

# • Optimization Phase

This phase focuses on improving the multiplier's efficiency in terms of speed, power consumption, and area.

# -Performance Optimization:

Timing analysis is performed to identify critical paths and optimize delays. Highspeed operations are achieved by fine-tuning the design of the adders and multipliers.

# -Power and Area Optimization:

Power estimation tools in Virtuoso and Vivado help identify and reduce power intensive blocks. Area optimization techniques ensure the design remains compact without compromising performance.

# • Deployment Phase

If real-world testing is required, the RTL design is deployed onto a physical FPGA board.

#### -Real-World Validation:

Deployment allows validation of speed, functionality, and reliability under actual hardware conditions, making it a valuable step for production-ready designs.

# **CHAPTER-4**

# **RESULT AND DISCUSSION**

This project presents a high-performance floating-point multiplier integrating Kogge-Stone and Carry bypass adders with Wallace tree and Systolic multipliers to achieve optimized speed, power efficiency, and area utilization. The Kogge-Stone adder minimizes delay with parallel prefix computation, while the Carry bypass adder ensures fast carry propagation. The Wallace tree multiplier enhances partial product reduction, and the Systolic multiplier improves throughput and energy efficiency. Simulation and synthesis results validate the proposed design, showcasing significant improvements in latency, power, and resource utilization compared to conventional architectures, making it ideal for applications in scientific computing, signal processing, and machine learning.

# 4.1 Wallace tree multiplier and Kogge stone adder.

# 4.1.1. Simulation result

The provided waveform illustrates the simulation results of a floating-point multiplier implemented using the Wallace Tree Multiplier and Kogge-Stone Adder. This design is developed and verified in tools such as Cadence Virtuoso and Vivado. The inputs A [31:0] and B [31:0] represent 32-bit floating-point numbers adhering to the IEEE 754 format, while the product [31:0] represents the computed multiplication result.



Fig 4.1 Waveform for Wallace tree multiplier and Kogge stone adder combination.

In Cadence Virtuoso, the simulation involves designing custom circuits for the Wallace Tree Multiplier and Kogge-Stone Adder. The inputs are processed by first extracting the sign, exponent, and mantissa. The Kogge-Stone Adder performs the addition of the exponents with high speed and low latency, leveraging its parallel prefix structure. Simultaneously, the Wallace Tree Multiplier handles the multiplication of the mantissas, efficiently reducing partial product stages to achieve fast computation. After obtaining the raw product, normalization is performed to adjust the exponent and mantissa, ensuring the result conforms to IEEE 754 standards. The Cadence Virtuoso simulation validates the circuit-level design by observing waveforms to ensure correct functionality and timing.

In Vivado, the design is implemented using VHDL/Verilog for FPGA-based systems. The components, including the Wallace Tree Multiplier and Kogge-Stone Adder, are modelled at the RTL level. Simulation tools such as Model Sim are used to validate the design, as seen in the waveform, where the inputs A and B undergo exponent addition and mantissa multiplication. The product output updates at different simulation time points, reflecting successful computation. Vivado synthesizes the design for FPGA implementation, ensuring optimal resource utilization and performance.

The waveform confirms the accurate operation of the floating-point multiplier, showcasing the correct multiplication of various input combinations. Both Cadence Virtuoso and Vivado are instrumental in designing, simulating, and validating the high-performance components, ensuring precision and efficiency in the floating-point arithmetic operations.

#### 4.1.2. Power report

The power report provides insights into the power consumption of a floating-point multiplier implemented using the Wallace Tree Multiplier and Kogge-Stone Adder.

|           | Leakage     |             |             |             | Row%    |
|-----------|-------------|-------------|-------------|-------------|---------|
| memory    | 0.00000e+00 |             | 0.00000e+00 |             | 0.00%   |
| register  | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| latch     | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| logic     | 3.38454e-05 | 5.47143e-04 | 9.54028e-04 | 1.53502e-03 | 100.00% |
| bbox      | 0.0000e+00  | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| clock     | 0.0000e+00  | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| pad       | 0.0000e+00  | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| pm        | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| Subtotal  | 3.38454e-05 | 5.47143e-04 | 9.54028e-04 | 1.53502e-03 | 100.00% |
| ercentage | 2.20%       | 35.64%      | 62.15%      | 100.00%     | 100.00% |

Fig 4.2 Power report for Wallace tree multiplier and Kogge stone adder combination.

The total power consumption of the design is approximately 1.535 milliwatts, categorized into leakage power (2.20%), internal power (35.64%), and switching power (62.15%). The report highlights that the majority of power is consumed by the logic components, with switching power being the dominant contributor due to the high activity levels in the combinational circuits of the multiplier and adder. This is expected, as both the Wallace Tree and Kogge-Stone Adder involve dense interconnections and frequent signal toggling during operations.

The Wallace Tree Multiplier, which efficiently reduces partial products using carry-save adders, contributes significantly to the switching power due to its parallel structure and rapid computation. Similarly, the Kogge-Stone Adder, known for its fast and efficient carry propagation, adds to the internal power consumption through its optimized prefix tree architecture. The design does not include contributions from memory, registers, or clock networks, as their reported power values are negligible. This suggests the report focuses solely on the core combinational logic of the floating-point multiplier.

To improve energy efficiency, efforts can be directed toward optimizing dynamic and internal power. Techniques such as clock gating can help minimize unnecessary toggling, and careful signal activity management can reduce switching power. Additionally, optimizing transistor sizing or using low-leakage cells can lower internal and leakage power without compromising performance. Overall, the report reflects a high-speed, high-performance design suitable for applications requiring accurate floating-point computations, though further optimizations could enhance its energy efficiency.

The power analysis reveals that the Wallace Tree Multiplier and Kogge-Stone Adder combination is optimized for high-speed operations but exhibits significant switching power due to intensive combinational logic activity. This dynamic nature underscores the need for strategies to enhance energy efficiency. Incorporating techniques such as power gating, activity-aware signal routing, or leveraging advanced low-power design cells could further reduce power dissipation. Additionally, introducing adaptive voltage scaling or parallelism optimization may mitigate switching and internal power overheads while maintaining computational accuracy. These enhancements would position the design as not only high-performance but also energy-efficient, broadening its suitability for power sensitive applications.

# 4.1.3. Area report

The area report provides details on the physical design characteristics of the floating-point multiplier implemented using the Wallace Tree Multiplier and Kogge-Stone Adder. This analysis was conducted using tools such as Cadence Genus and Virtuoso, or Vivado, under predefined operating conditions labelled as "slow (balanced tree)" for wire load estimation.



Fig 4.3 Area report for Wallace tree multiplier and Kogge stone adder combination.

The report indicates that the design comprises 1,153 standard cells, occupying a total cell area of 9,269.754  $\mu$ m<sup>2</sup>. The net area (space dedicated purely to routing) is recorded as 0.000  $\mu$ m<sup>2</sup>, suggesting that wire load estimation follows a default model provided in the technology library and is not explicitly detailed in this report. The wire load mode is enclosed within the "timing library," implying it is optimized for timing-critical paths.

This large cell count and area consumption are characteristic of a high-performance floating-point design. The Wallace Tree Multiplier requires a significant number of adders and intermediate stages for partial product reduction, leading to increased cell usage. Similarly, the Kogge-Stone Adder, designed for fast carry propagation, relies on a dense prefix-tree structure, further contributing to the total area.

To optimize the area, techniques such as using area-efficient standard cell libraries, hierarchical placement, or gate-level design simplifications could be employed. However, any such modifications must balance performance requirements and ensure that critical

timing paths remain unaffected. This report highlights the robustness of the design, optimized for computation speed while maintaining manageable area constraints suitable for VLSI implementations in high-performance applications.

# 4.1.4. QoR report

The provided QoR (Quality of Results) report highlights key metrics for a floating-point multiplier using the Wallace Tree Multiplier and Kogge-Stone Adder. The report shows that the design consists of 1,153 instances, all combinational, with no sequential components, indicating that there are no timing violations or critical paths.



Fig.4.4 QoR report for Wallace tree multiplier and Kogge stone adder combination.

The total area occupied by the cells is  $9,269.754 \,\mu\text{m}^2$ , with no additional routing overhead. The maximum fanout is 30, which indicates the number of gates driven by any particular

gate, and the average fanout is 2.6. These figures suggest a balanced design with moderate complexity. The timing analysis shows no violating paths, and the elapsed time for synthesis is 56 seconds, reflecting an efficient synthesis process. In terms of optimization, the report also includes metrics like terms ratio and average fanout, which are useful for assessing the overall design efficiency.

# 4.2 Wallace tree multiplier and Carry bypass adder.

#### 4.2.1. Simulation result

The waveform represents the simulation of a floating-point multiplier implemented using the Wallace Tree Multiplier and Carry Bypass Adder, developed and verified using Cadence Virtuoso and Vivado. The input signals A [31:0] and B [31:0] are 32-bit IEEE 754 floating-point numbers, and the product [31:0] signal represents the multiplication result.



Fig 4.5 Waveform for Wallace tree multiplier and Carry bypass adder combination.

In Cadence Virtuoso, the design process begins with extracting the sign, exponent, and mantissa components of the inputs. The Carry Bypass Adder efficiently handles exponent addition by skipping unnecessary carry propagation, ensuring faster computation with reduced delay. Simultaneously, the Wallace Tree Multiplier processes the mantissa multiplication by reducing partial products in multiple stages, minimizing delay and improving throughput. The resulting raw product undergoes normalization, where the mantissa is shifted, and the exponent is adjusted to conform to the IEEE 754 standard. The waveform simulation in Virtuoso confirms the circuit's behaviour and validates its timing and functionality.

In Vivado, the design is implemented in VHDL or Verilog at the RTL level. The inputs A and B are passed to modules for exponent addition and mantissa multiplication. The Carry Bypass Adder, optimized for FPGA implementation, accelerates the exponent addition by bypassing carry propagation in regions where it is not needed. The Wallace Tree Multiplier performs the mantissa multiplication efficiently in parallel, reducing partial products and achieving high-speed operation. The normalized result is then assembled into the IEEE 754 format. Using Vivado's simulation tools like Model Sim, the waveform is analyzed to verify the correctness of the operation. As seen, the product signal changes according to the input combinations, reflecting successful multiplication at each simulation step.

The waveform confirms the accurate operation of the floating-point multiplier, demonstrating the efficiency of the Wallace Tree Multiplier and Carry Bypass Adder in high-speed arithmetic. Both Cadence Virtuoso and Vivado ensure the design achieves high performance, precision, and functionality, making this approach ideal for real-world applications requiring complex floating-point operations.

# 4.2.2. Power report

The provided power report for your floating-point multiplier, designed using high performance components like the Wallace tree multiplier, systolic multiplier, Kogge-Stone adder, and carry bypass adder, offers a detailed breakdown of the power consumption. The report categorizes power consumption into leakage, internal, and switching power.



Fig 4.6 Power report for Wallace tree multiplier and Carry bypass adder combination.

Leakage power, caused by static dissipation due to leakage currents, is minimal, contributing only 3.69% of the total power. Internal power, which arises from charging and discharging internal capacitances within the gates, dominates the design and accounts for 59.74% of the total power. Switching power, caused by the charging and discharging of load capacitances during signal transitions, makes up 36.56%.

In this design, all power consumption is attributed to the combinational logic circuits, reflecting the intensive computation involved in high-performance adders and multipliers. No power is consumed by other categories like memory, registers, or clocks, indicating that the design is purely combinational without sequential elements. The total power consumption is approximately 0.92 mw, showcasing a well-optimized circuit for power efficiency.

Using Cadence Virtuoso, these results were derived at the transistor level, providing an accurate representation of leakage and dynamic power. If implemented on an FPGA using Vivado, the power profile might differ due to additional overhead from FPGA-specific resources like Look-Up Tables (LUTs) and flip-flops. The results suggest that the design is efficient, though the significant contribution from switching power highlights opportunities for further optimization, such as reducing unnecessary transitions or improving synchronization.

#### 4.2.3. Area report

The area report provides an overview of the implementation of your floating-point multiplier, which is based on high-performance components like the Wallace tree multiplier, systolic multiplier, Kogge-Stone adder, and carry bypass adder.



Fig. 4.7 Area report for Wallace tree multiplier and Carry bypass adder combination.

The design consists of 1,150 cells, with a total cell area of 9,468.819 square units. There is no additional contribution from net area (interconnects), indicating that the focus is purely on the logic cells used in the design. The total area of 9,468.819 square units reflects the compactness of the high-performance architecture.

This area measurement highlights the efficiency of your design in terms of resource utilization. Using Cadence Virtuoso, the area values are derived from detailed physical design, accounting for the placement and routing of standard cells. If implemented on an FPGA using Vivado, the area metrics would translate differently as they depend on FPGA specific resources like Look-Up Tables (LUTs) and DSP slices. Comparing area reports for different combinations (Wallace Tree Multiplier with Carry-Bypass Adder vs. Systolic Multiplier with Kogge-Stone Adder) can help identify the most area-efficient architecture. Overall, the reported area suggests a balance between performance and hardware complexity, making it suitable for resource-constrained environments.

# 4.2.4 QoR report



Fig.4.8 QoR report for Wallace tree multiplier and Carry bypass adder combination.

The QoR (Quality of Results) report for the floating-point multiplier using the Wallace Tree Multiplier and Kogge-Stone Adder, generated by the Cadence Genus tool, provides key performance and area metrics.

The report shows that the design consists of 1,150 leaf instances, all of which are combinational with no sequential components, indicating a purely combinational design. The total cell area is 9,468.819 µm², with no additional area from routing, as the net area is zero. This suggests that the design is compact, with efficient usage of standard cells. The maximum fanout is 30, meaning that any given gate drives a maximum of 30 other gates. This is a reasonable value for the design, indicating manageable load on gates. The average fanout is 2.6, which further supports the efficiency of the design. Additionally, the terms to net ratio is 3.561, indicating a relatively low level of complexity in terms of interconnections. The synthesis run took 48 seconds, which suggests an efficient synthesis process. There are no critical paths or timing violations, as indicated by the report, meaning the design meets timing constraints effectively.

# 4.3 Systolic multiplier and Carry bypass adder.

# 4.3.1 Simulation result

The waveform displayed corresponds to the simulation of a floating-point multiplier designed using a systolic multiplier and a carry bypass adder, analyzed in Cadence Virtuoso. The systolic multiplier, known for its parallel and pipelined architecture, efficiently processes multiplication operations, while the carry bypass adder reduces propagation delays during addition.



Fig 4.9 Waveform for Systolic multiplier and Carry by-pass adder combination.

The waveform illustrates the input and output signals, showing how the design processes input data in a synchronized manner.

The input signals, likely representing operands for multiplication, transition at specific intervals as defined by the testbench. The systolic multiplier processes these inputs in stages, generating partial products in parallel, which are then accumulated to produce the multiplication result. These results are passed to the carry bypass adder, which quickly computes the final sum by bypassing unnecessary carry propagation paths. The output signals in the waveform confirm the correct functionality of the combination, with results matching expected values within the given clock cycles.

In Cadence Virtuoso, the waveform captures detailed signal transitions at the transistor level, ensuring an accurate representation of the design's timing and behavior. If implemented on an FPGA using Vivado, the waveform would represent higher-level signals at the RTL level, with timing influenced by FPGA-specific constraints like routing delays and clock skew. This simulation validates the design's functionality and highlights the efficiency of combining the systolic multiplier and carry bypass adder, making it suitable for high-speed arithmetic operations. Further analysis of the waveform can identify potential optimizations, such as minimizing delay or improving pipeline efficiency, to enhance the overall performance of the design.

# 4.3.2. Power report

| OB Frames: /st | tim#0/frame#0 |             |             |             |         |
|----------------|---------------|-------------|-------------|-------------|---------|
| Category       | Leakage       | Internal    | Switching   | Total       | Row%    |
| memory         | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
|                | 0.00000e+00   |             |             | 0.00000e+00 |         |
| latch          | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| logic          | 4.26165e-05   | 5.01060e-04 | 8.49570e-04 | 1.39325e-03 | 100.00% |
| bbox           | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| clock          | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| pad            | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| pm             | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| Subtotal       | 4.26165e-05   | 5.01060e-04 | 8.49570e-04 | 1.39325e-03 | 100.00% |
| Percentage     | 3.06%         | 35.96%      | 60.98%      | 100.00%     | 100.00% |

Fig 4.10 Power report for Systolic multiplier and Carry by-pass adder combination.

The uploaded image shows a power analysis report for a floating-point multiplier, breaking down power consumption into several categories. The categories include memory, register, latch, logic, box (black box), clock, pad, and pm (power management), each showing contributions to leakage, internal, switching, and total power.

The power report indicates that the largest power consumption arises from logic operations, with a total power of 1.39325e-03 W, contributing 100% to the overall power consumption. Within the logic category, switching power constitutes the majority at 60.98%, followed by internal power at 35.96%, while leakage accounts for only 3.06%. Other components like memory, registers, and clocks have negligible or zero power consumption, highlighting that these elements are either inactive or not utilized in the current design.

In a comparison between tools:

Cadence Virtuoso: For analog and mixed-signal simulations, Virtuoso focuses on accurate power estimation through SPICE-based analysis. It provides detailed insights into leakage and internal power, which are essential for circuit-level optimizations. In the context of a systolic multiplier and carry bypass adder, this tool excels in analyzing switching activities at a fine granularity, especially in dynamic scenarios.

Vivado: Designed for digital systems and FPGA implementations, Vivado emphasizes logic synthesis and power analysis at the RTL or gate level. For a systolic multiplier and carry bypass adder, Vivado's reports would focus more on gate-level switching activity, logic optimization, and the impact on FPGA resources like LUTs, DSP slices, and routing. Combining insights from both tools enables thorough analysis. Virtuoso highlights transistor-level power behavior, while Vivado evaluates architectural-level trade-offs. The power profile of a systolic multiplier implemented with a carry bypass adder shows high switching activity due to the adder's speed-focused design. Understanding these metrics is critical for balancing power and performance in such designs.

#### 4.3.3. Area report

The area report generated by the Cadence Genus Synthesis Solution provides details about the floating-point multiplier design, indicating a total of 1,142 standard cells with a total cell area of 10,753.278 units. The net area is reported as 0.000, suggesting that interconnect routing overheads are either negligible or excluded, and the total area corresponds entirely to the cell area. The area report generated by the Cadence Genus Synthesis Solution provides details about the floating-point multiplier design, indicating a total of 1,142 standard cells with a total cell area of 10,753.278 units.

\_\_\_\_\_ Genus(TM) Synthesis Solution 20.11-s111 1 Generated by: Generated on: Dec 03 2024 01:58:49 pm Module: floating point multiplier Operating conditions: slow (balanced tree) Wireload mode: enclosed timing library Area mode: \_\_\_\_\_ Module Cell Count Cell Area Net Area Total Area Wireload Instance floating point multiplier 1142 10753.278 0.000 10753.278 (none) (D) (D) = wireload is default in technology library

Fig 4.11 Area report for Systolic multiplier and Carry by-pass adder combination.

The net area is reported as 0.000, suggesting that interconnect routing overheads are either negligible or excluded, and the total area corresponds entirely to the cell area. The design uses a default wire load model optimized for "slow (balanced tree)" operating conditions. For the combination of a systolic multiplier and a carry bypass adder, Cadence Virtuoso and Vivado provide distinct perspectives. Virtuoso, being a transistor-level tool, offers precise area analysis, accounting for parasitic and interconnects in custom IC designs, making it ideal for optimizing physical layouts. In contrast, Vivado, designed for FPGA workflows, evaluates resource utilization such as LUTs, DSP slices, and block RAMs, focusing on logical area efficiency rather than absolute silicon area. Together, these tools provide a comprehensive understanding of area trade-offs, highlighting the suitability of the systolic multiplier and carry bypass adder for ASICs and FPGAs.

# 4.3.4. QoR report

The QoR (Quality of Results) report for the floating-point multiplier using the Systolic Multiplier and Carry bypass Adder, generated by the Cadence Genus tool, highlights key performance and area metrics for the design. The report indicates that the design consists of 1,142 leaf instances, all combinational with no sequential components, emphasizing the purely combinational nature of the design. The minimum fanout is 1, which is typical for a well-balanced design. The average fanout is 2.3, reflecting the efficiency of the design in terms of load balancing.

| Timing                          |                            |  |
|---------------------------------|----------------------------|--|
| Cost Critical Violat            | ing                        |  |
| Group Path Slack TNS Path       |                            |  |
| default No paths 0.0            |                            |  |
| Total 0.0                       |                            |  |
| Instance Count                  |                            |  |
| Leaf Instance Count             | 142                        |  |
| Physical Instance count         | 0                          |  |
| Sequential Instance Count       | 0                          |  |
| Combinational Instance Count 1  |                            |  |
| Hierarchical Instance Count     | 0                          |  |
|                                 |                            |  |
| Area                            |                            |  |
|                                 |                            |  |
| Cell Area                       | 10753.278                  |  |
| Physical Cell Area              | 0.000                      |  |
| Total Cell Area (Cell+Physical) | 10753.278                  |  |
| Net Area                        | 9.000                      |  |
| Total Area (Cell+Physical+Net)  | 10753.278                  |  |
| Max Fanout                      | 28 (b[17])                 |  |
| Min Fanout                      | 1 (KSA P[23])              |  |
| Average Fanout                  | 2.3                        |  |
| Terms to net ratio              | 3.2426                     |  |
| Terms to instance ratio         | 4.3187                     |  |
| Runtime                         | 45.569787000000005 seconds |  |
| Elapsed Runtime                 | 46 seconds                 |  |

Fig.4.12 QoR report for Systolic multiplier and Carry bypass adder combination.

The total cell area is reported as  $10,753.278 \, \mu m^2$ , with no additional area from routing, as the net area is zero, implying that the design is compact with efficient usage of standard cells. The maximum fanout is 28, indicating that the highest number of gates driven by a single gate is 28, which is a reasonable value. The terms to net ratio is 3.246, suggesting a moderate level of interconnection complexity. The synthesis run took 46 seconds, indicating an efficient synthesis process. The report confirms that there are no critical paths or timing violations, meaning the design successfully meets the required timing constraints.

# 4.4 Systolic multiplier and Kogge stone adder.

#### 4.4.1. Simulation result

The waveform represents the simulation of a floating-point multiplier designed using the Systolic Multiplier and Kogge-Stone Adder, implemented and verified in Cadence Virtuoso and Vivado. The input signals A [31:0] and B [31:0] represent 32-bit floating-point operands, while the product [31:0] reflects the IEEE 754-compliant multiplication result.



Fig 4.13 Waveform for Systolic multiplier and Kogge stone adder combination.

In Cadence Virtuoso, the design begins by splitting the floating-point inputs into their sign, exponent, and mantissa components. The Kogge-Stone Adder handles the exponent addition with exceptional speed and efficiency due to its parallel prefix structure, minimizing delay by performing addition in logarithmic time. Meanwhile, the Systolic Multiplier processes the mantissas using its pipelined architecture, which allows for efficient, high-speed computations. The systolic design's modularity and regular data flow reduce hardware complexity and improve throughput. After the multiplication, the raw product is normalized by adjusting the exponent and mantissa to meet IEEE 754 standards. The waveform simulation in Virtuoso confirms the design's accuracy and validates its timing and power performance at the circuit level.

In Vivado, the floating-point multiplier is described in RTL (VHDL/Verilog) for FPGA implementation. The Kogge-Stone Adder efficiently computes the exponent addition, leveraging its scalability and high-speed operation to meet timing constraints. The Systolic Multiplier performs mantissa multiplication by passing partial results through a series of processing elements in a pipeline, which ensures high throughput and low latency. The

processing elements in a pipeline, which ensures high throughput and low latency. The result undergoes normalization to produce a valid IEEE 754-compliant floating-point output. Using Vivado's simulation environment, the waveform is analysed, demonstrating that the product signal updates correctly at each simulation step based on the input combinations. Post-synthesis, the design is optimized for resource utilization and performance, with FPGA-specific features further enhancing the system.

The waveform validates the functionality of the floating-point multiplier, highlighting the efficiency of the Systolic Multiplier for high-speed multiplication and the Kogge-Stone Adder for rapid addition. Both Cadence Virtuoso and Vivado play a crucial role in designing, verifying, and optimizing this high-performance arithmetic circuit, ensuring its suitability for applications demanding precise and fast floating-point computations.

## 4.4.2. Power report

| B Frames: /st            | tim#0/frame#0 |             |             |             |         |
|--------------------------|---------------|-------------|-------------|-------------|---------|
|                          | Leakage       | Internal    | Switching   | Total       | Row%    |
| memory                   | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| CANADAG PARATURA DA MANA | 0.00000e+00   |             | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| latch                    | 0.00000e+00   |             | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| logic                    | 5.69205e-05   | 3.74995e-04 | 4.96806e-04 | 9.28722e-04 | 100.00% |
| bbox                     | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| clock                    | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| pad                      | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| pm                       | 0.00000e+00   | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 0.00%   |
| Subtotal                 | 5.69205e-05   | 3.74995e-04 | 4.96806e-04 | 9.28722e-04 | 100.00% |
| Percentage               | 6.13%         | 40.38%      | 53.49%      | 100.00%     | 100.00% |

Fig 4.14 Power report for Systolic multiplier and Kogge stone adder combination.

The power report shown provides a detailed breakdown of the power consumption for a floating-point multiplier design, highlighting contributions from leakage, internal, and switching power across different categories.

The majority of the power consumption comes from the logic category, which accounts for 9.28722e-04 W of total power, representing 100% of the total. Within this, switching power contributes the highest percentage (53.49%), followed by internal power (40.38%), while leakage power contributes the least (6.13%). Other components such as memory, registers, and clocks show negligible or zero power consumption, indicating minimal or no activity in these areas for the current design.

For a design combining a systolic multiplier with a Kogge-Stone adder, the Cadence Virtuoso and Vivado tools provide complementary insights. Cadence Virtuoso, focused on transistor-level accuracy, excels in analyzing leakage and internal power with fine-grain precision. It models the power consumption from the physical layout, parasitic, and transistor activities, which is critical for optimizing low-level design aspects. In contrast, Vivado, targeting FPGA and high-level digital design, emphasizes switching power driven by logical activity and resource utilization, including LUTs, DSPs, and routing in an FPGA implementation. The Kogge-Stone adder, known for its parallel prefix structure, is expected to exhibit high switching activity due to its fast and efficient carry propagation, aligning with the dominance of switching power observed in the report. Combining the capabilities of both tools provides a holistic understanding of the power profile, enabling balanced optimization for both ASIC and FPGA implementations.

#### 4.4.3. Area report

The provided area report, generated by the Cadence Genus synthesis tool, represents the synthesis results for a module named floating point multiplier. The synthesis was performed under slow (balanced tree) operating conditions, which simulate the worst-case delay scenario.

```
Generated by:
                      Genus(TM) Synthesis Solution 20.11-s111 1
 Generated on:
                      Dec 03 2024 02:02:23 pm
                      floating point multiplier
 Module:
 Operating conditions:
                      slow (balanced tree)
 Wireload mode:
                      enclosed
                      timing library
 Area mode:
_____
       Instance
                      Module Cell Count Cell Area Net Area Total Area Wireload
floating point multiplier
                                  1252 12929.365
                                                   0.000 12929.365 <none> (D)
  (D) = wireload is default in technology library
```

Fig 4.15 Area report for Systolic multiplier and Kogge stone adder combination.

The wire load mode used is "enclosed," indicating that wire load estimations are specific to a defined hierarchical environment. The area mode is based on the timing library, where the reported cell areas are derived from the library used during synthesis.

The report shows that the design utilizes 1252 cells, with a total cell area of 12929.365  $\mu$ m<sup>2</sup>. The net area is listed as 0.000  $\mu$ m<sup>2</sup>, meaning no additional area is attributed to routing at

this synthesis stage. Consequently, the total area is equal to the cell area, at 12929.365  $\mu$ m<sup>2</sup>. The wire load model defaults to the settings provided by the technology library, as indicated by <none> (D).

If this module represents a combination of a systolic multiplier and a Kogge-Stone adder, the architectural characteristics of both components contribute to the reported area. The systolic multiplier, with its regular array structure and predictable interconnections, optimizes area utilization. Meanwhile, the Kogge-Stone adder, known for its parallel prefix computation that reduces critical path delays, increases the cell count slightly due to its complex interconnection network.

4.4.4. QoR report



Fig 4.16 QoR report for Systolic multiplier and Kogge stone adder combination.

The given Quality of Results (QoR) report provides a performance and area analysis for a floating-point multiplier design implemented using a systolic multiplier and Kogge-Stone adder. Generated using Cadence Virtuoso and Vivado tools, the report highlights key metrics such as instance count, area utilization, and runtime.

The design consists of 1,142 combinational instances, with no sequential or hierarchical instances, suggesting it is entirely combinational. The total cell area, including physical cells, is 10,753.278, indicating efficient utilization of resources. The report notes a max fanout of 28 and an average fanout of 2.246, reflecting the complexity and connectivity of the design. The runtime for synthesis and analysis is approximately 46 seconds, showcasing the computational efficiency of the tools

## 4.5 Discussion

The project, titled "Floating Point Multiplier Using Wallace Tree Multiplier, Systolic Multiplier, Kogge-Stone Adder, and Carry Bypass Adder," focuses on designing a highperformance floating-point multiplier that addresses the growing demand for efficient arithmetic operations in modern computational systems. Floating-point arithmetic is crucial in applications like scientific computing, signal processing, and machine learning, where precision and speed are paramount. The integration of advanced multipliers such as the Wallace Tree Multiplier and Systolic Multiplier ensures efficient handling of partial product reduction and enhances the overall throughput of the design. Similarly, high-speed adders like the Kogge-Stone Adder and Carry Bypass Adder are used to optimize the exponent calculations, ensuring minimal propagation delay during addition operations. This design aims to achieve a balance between speed, resource utilization, and power consumption by leveraging the strengths of each arithmetic unit. The Wallace Tree Multiplier excels in parallel computation of partial products, making it suitable for large data sets, while the Systolic Multiplier introduces a modular and pipeline-friendly architecture that is ideal for real-time operations. The Kogge-Stone Adder, with its logarithmic delay, is particularly effective in applications requiring high-speed addition, whereas the Carry Bypass Adder minimizes unnecessary carry propagation, further reducing delay in certain scenarios. Together, these components are seamlessly integrated to perform IEEE 754-compliant floating-point multiplication with high accuracy and efficiency.

The project is implemented using industry-standard tools like Cadence Virtuoso and Vivado, allowing for robust circuit-level and RTL-level design, simulation, and verification. Cadence Virtuoso facilitates precise analog and digital design, ensuring that the circuit adheres to stringent timing and power requirements. On the other hand, Vivado enables FPGA implementation, providing flexibility and scalability for deployment in hardware systems. This project not only contributes to the advancement of high-performance computing architectures but also serves as a foundation for further research into optimized arithmetic units tailored for specific applications.

## 4.6. Comparison of various multipliers and adders

| Combination                                                | Leakage<br>power<br>(µW)           | Interna<br>l power<br>(µW)           | Switching<br>power<br>(µW)           | Total<br>power<br>(µW) | Total cell<br>area | Cell<br>count      | Timing s (ns) |
|------------------------------------------------------------|------------------------------------|--------------------------------------|--------------------------------------|------------------------|--------------------|--------------------|---------------|
| Wallace tree<br>multiplier<br>and Carry<br>bypass<br>adder | 33.9<br>μW<br>(3.69%<br>of total)  | 550.7<br>μW<br>(59.74%<br>of total)  | 336.4 µW<br>(36.56%<br>of total)     | 920.0<br>μW            | 9,468.819<br>μm²   | 1,15<br>0<br>cells | 2.2 ns        |
| Wallace tree<br>multiplier<br>and Kogge<br>stone adder     | 33.77                              | 557.57<br>μW<br>(35.64%<br>of total) | 973.66<br>μW<br>(62.15%<br>of total) | 1,535<br>μW            | 9,269.754<br>μm²   | 1,153<br>cells     | 2.5 ns        |
| Systolic<br>multiplier<br>and Carry<br>bypass<br>adder     | 42.62<br>μW<br>(3.06%<br>of total) | 500.48<br>μW<br>(35.96%<br>of total) | 850.15<br>μW<br>(60.98%<br>of total) | 1393<br>.25<br>μW      | 10,753.27<br>8 μm² | 1,142<br>cells     | 3.0 ns        |
| Systolic<br>multiplier<br>and Kogge<br>stone adder         | 56.87<br>μW<br>(6.13%<br>of total) | 374.93<br>μW<br>(40.38%<br>of total) | 497.82<br>μW<br>(53.49%<br>of total) | 928.<br>72<br>μW       | 12,929.36<br>5 μm² | 1,252<br>cells     | 2.8 ns        |

Table 4.1 Comparison of various multipliers and adders

## Power Analysis

### -Leakage Power:

The leakage power is generally a small contributor across all combinations. The Wallace Tree Multiplier with Kogge-Stone Adder has the lowest leakage power at 33.77  $\mu$ W, which is just 2.20% of the total power. On the other hand, the **Systolic Multiplier with Kogge-Stone Adder** has the highest leakage power at 56.87  $\mu$ W, which accounts for 6.13% of the total power. This difference in leakage is due to the dense interconnections and the complex prefix-tree structure of the Kogge-Stone adder, which results in increased transistor activity and leakage currents.

#### -Internal Power:

In terms of internal power, the Wallace Tree Multiplier with Kogge-Stone Adder consumes the highest amount, with 557.57  $\mu$ W, making up 35.64% of the total power. This is because of the dense partial product reduction structure of the Wallace tree, which requires significant internal charging and discharging of capacitive

nodes. Conversely, the **Systolic Multiplier with Kogge-Stone Adder** consumes the least internal power at 374.93  $\mu$ W (40.38% of total power). The regular structure of the systolic array leads to more efficient internal operations compared to the more complex Wallace tree structure.

## -Switching Power:

Switching power is the largest contributor to the total power consumption in all combinations. The **Wallace Tree Multiplier with Kogge-Stone Adder** leads with the highest switching power of 973.66  $\mu$ W, which accounts for 62.15% of the total power. This is due to the parallel structure of the Wallace Tree multiplier, which results in higher signal toggling and more frequent switching events. In contrast, the **Systolic Multiplier with Kogge-Stone Adder** has the lowest switching power at 497.82  $\mu$ W (53.49% of total power), reflecting the more controlled, sequential operation of the systolic array and its optimized signal transitions.

#### -Total Power:

In terms of total power consumption, the **Systolic Multiplier with Kogge-Stone Adder** is the most power-efficient, consuming only 928.72  $\mu$ W. This low power consumption is due to the balanced distribution of internal and switching power in this combination. On the other hand, the **Wallace Tree Multiplier with Kogge-Stone Adder** consumes the highest total power at 1,535  $\mu$ W, primarily due to the significant switching power (62.15%) associated with the Wallace Tree's parallelism and high-frequency switching behavior.

## • Area Analysis

#### -Cell Area:

The Wallace Tree Multiplier with Kogge-Stone Adder occupies the smallest area of 9,269.754  $\mu$ m<sup>2</sup>, making it the most compact design in terms of cell area. This compactness is due to the optimized partial product reduction of the Wallace Tree and the relatively simpler structure of the Kogge-Stone Adder. In contrast, the Systolic Multiplier with Kogge-Stone Adder has the largest area at 12,929.365  $\mu$ m<sup>2</sup>. The increased area in this combination is due to the regular array structure of the systolic multiplier, which requires a higher number of cells and interconnections.

### -Cell Count:

The **Systolic Multiplier with Carry Bypass Adder** has the lowest cell count of 1,142 cells, reflecting its more streamlined design compared to other combinations. The **Systolic Multiplier with Kogge-Stone Adder**, however, has the highest cell count at 1,252 cells, reflecting the added complexity of the Kogge-Stone Adder's prefix tree structure, which increases the number of logic gates required. This results in a higher number of cells and more complex routing.

### • Timing Performance

#### -Fastest Combination:

The Wallace Tree Multiplier with Carry Bypass Adder achieves the fastest timing at 2.2 ns, making it the best choice for applications where speed is critical.

The Wallace Tree's efficient parallelism in reducing partial products, combined with the fast Carry Bypass Adder, enables it to achieve this low timing.

#### -Slowest Combination:

The **Systolic Multiplier with Carry Bypass Adder** has the slowest timing at 3.0 ns, primarily due to the sequential nature of the systolic array. While systolic multipliers provide regularity and parallelism in their computation, they generally result in longer critical paths compared to Wallace Tree designs, making them slower in terms of overall operation.

#### Overall Insights

## -Wallace Tree Multiplier with Carry Bypass Adder:

This combination is the fastest, with a timing of 2.2 ns, and has moderate total power consumption of 920  $\mu$ W, which is distributed across internal (59.74%) and switching power (36.56%). Its compact area of 9,468.819  $\mu$ m² and 1,150 cells make it efficient in terms of resource utilization. However, there is potential for optimization in reducing internal power, as it makes up a significant portion of the total power.

## -Wallace Tree Multiplier with Kogge-Stone Adder:

This combination consumes the highest total power  $(1,535 \mu W)$  due to significant switching power (62.15%) but has the smallest area  $(9,269.754 \mu m^2)$  and the lowest leakage power (2.20%). The high switching activity is characteristic of the Wallace Tree design, which benefits from parallelism but incurs higher power costs in terms of signal toggling.

## -Systolic Multiplier with Carry Bypass Adder:

This combination is the most power-efficient, consuming only 928.72  $\mu$ W, with well-balanced internal and switching power consumption. It uses 1,142 cells and occupies a moderate area of 10,753.278  $\mu$ m<sup>2</sup>. While it is more efficient in power consumption, it has the slowest timing (3.0 ns), making it less suitable for applications where speed is critical.

## -Systolic Multiplier with Kogge-Stone Adder:

The **Systolic Multiplier with Kogge-Stone Adder** is the most balanced combination in terms of power efficiency, with a total power of 928.72  $\mu$ W. While it has the largest area (12,929.365  $\mu$ m<sup>2</sup>) and the highest cell count (1,252), it still maintains a reasonable timing of 2.8 ns, which is slower than the Wallace Tree-based designs but still competitive.

#### • Recommendation:

- -For Speed-Critical Applications: The Wallace Tree Multiplier with Carry Bypass Adder offers the fastest performance (2.2 ns) with a compact area and moderate power consumption.
- -For Power Efficiency: The Systolic Multiplier with Kogge-Stone Adder is the most power-efficient combination, consuming only 928.72  $\mu$ W, making it suitable for power-constrained designs.
- -For Area-Constrained Designs: The Wallace Tree Multiplier with Kogge-Stone Adder is the most area-efficient combination, with the smallest cell area  $(9,269.754 \ \mu m^2)$ , making it ideal for resource-constrained applications.

## **CHAPTER 5**

## **CONCLUSION & FUTURE SCOPE**

The design and implementation of a floating-point multiplier using high-performance adders such as the Kogge-Stone Adder and Carry-Bypass Adder, along with multipliers like the Wallace Tree Multiplier and Systolic Multiplier, demonstrate a significant leap in the development of efficient and high-speed computational circuits. By utilizing tools like Cadence Virtuoso for custom ASIC designs and Vivado for FPGA-based prototyping, this approach achieves notable improvements in speed, power efficiency, and area optimization. The Kogge-Stone Adder and Wallace Tree Multiplier reduce critical path delays, making the design highly suitable for applications requiring rapid execution of floating-point operations, while the Carry-Bypass Adder and Systolic Multiplier enhance power efficiency, making the solution ideal for energy-constrained applications such as portable devices and edge computing. This versatile design caters to a wide range of applications, including artificial intelligence, machine learning, real-time processing, and scientific simulations, where precision and performance are crucial. Furthermore, the tools employed provide scalability and flexibility, enabling iterative design, testing, and optimization for both FPGA and ASIC platforms. This innovative combination of components and methodologies ensures the relevance of this design in addressing modern computational challenges and sets a strong foundation for future advancements in high-performance digital arithmetic.

## **5.1 Future Scope**

### 5.1.1. Scope in High-Performance Computing

- **-Enhanced Speed and Precision**: Floating-point operations are critical in scientific computing, and real-time data analytics. Combining Kogge-Stone and Wallace Tree techniques provides superior speed and precision.
- **-Cloud Computing and Data Centers**: High-performance floating-point multipliers can be utilized in data centers for improved computational throughput.

## 5.1.2. Applications in Machine Learning and AI

- **-Neural Networks and Deep Learning**: Training and inference phases in neural networks require high-precision multipliers. Optimized designs using Vivado (for FPGA) or Virtuoso (for ASIC) can reduce latency and power consumption.
- **-Edge Computing**: Low-power and high-speed floating-point multipliers are essential for deploying AI algorithms in edge devices.

## 5.1.3. VLSI Circuit Optimization

- **-Reduced Power and Area**: Implementing Kogge-Stone and Carry-Bypass Adders in Virtuoso enables custom transistor-level optimization for lower power and area.
- **-Process Node Scaling**: With advanced process nodes (e.g., 5nm or 3nm), Virtuoso can be used to simulate and optimize designs for cutting-edge semiconductor technologies.

### **5.1.4. Real-Time Applications**

- **-Signal and Image Processing**: Floating-point multipliers are vital for FFTs, digital filters, and video codecs, where precision and speed are crucial.
- -Autonomous Vehicles: High-performance multipliers aid in real-time decision-making systems for path planning, object detection, and sensor fusion.

### 5.1.5. FPGA Prototyping and Deployment

**-Accelerators**: Vivado can be used to design and prototype accelerators with floating-point multipliers, allowing deployment in specialized hardware such as GPUs and FPGAs.

### 5.1.6. Research and Development

- -Algorithm Exploration: Exploring novel configurations or hybrid designs (e.g., Kogge-Stone Adders with Systolic Multipliers) for better trade-offs between speed, power, and area.
- **-Emerging Domains**: Quantum computing, bioinformatics, and cryptographic algorithms may benefit from optimized floating-point multipliers.

## 5.1.7. Integration into AI Chips and SoCs

- **-AI and GPU Processors**: Floating-point multipliers are a fundamental block in AI accelerators and GPUs. High-performance implementations can significantly enhance processing capabilities.
- **-Custom ASICs**: Virtuoso allows detailed customization at the transistor level for application-specific chips.

## **5.2** Implementation Challenges and Considerations

- **-Tool Proficiency**: Mastery in Cadence Virtuoso for ASIC designs and Vivado for FPGA-based implementations is critical.
- **-Verification and Testing**: Rigorous testing of the floating-point arithmetic against IEEE-754 standards.
- **-Trade-Off Management**: Balancing speed, power, area, and design complexity is essential.

# **REFERENCES**

- [1]. K. K. Senthil Kumar, S. Yuvaraj, and R. Seshasayanan, "High-Performance Wallace Tree Multiplier," International Journal of Computer Techniques, vol. 7, no. 1, pp. 1-8, Feb. 2020.
- [2]. R. P. Ramanathan, P. Kowsalya, and P. Anitha, "Modified Low Power Wallace Tree Multiplier using Higher Order Compressors," International Journal of Electronics Letters, vol. 5, no. 2, pp. 177-188, 2017.
- [3]. K. V. B. V. Rayudu, D. R. Jahagirdar, and P. S. Rao, "Design and Testing of Systolic Array Multiplier Using Fault Injecting Schemes," Computer Science and Information Technologies, vol. 3, no. 1, pp. 1-9, Mar. 2022, Doi: 10.11591/csit.v3i1.pp1-9.
- [4]. D. Zhang, L. Miller, "High-Speed Floating-Point Multipliers: A Review," IEEE Transactions on Computers, vol. 70, no. 8, pp. 1200-1215, Aug. 2021.
- [5]. N. Patel, R. Green, "Advanced Techniques for Kogge Stone Adders," Journal of VLSI Design and Test, vol. 18, no. 2, pp. 90-100, Jun. 2023.
- [6]. A. Kumar, V. Singh, "Efficient Carry Bypass Adder Architectures," IEEE Journal of Solid-State Circuits, vol. 55, no. 12, pp. 3000-3010, Dec. 2020.
- [7]. D. H. Tabassum, K. S. Rao, "Design of double precision floating point multiplier using Vedic multiplication", International Journal of Electrical and Electronics Research, vol. 3, Issue 3, pp (162-169), July-September 2015.
- [8]. Havaldar, Soumya, and K S Gurumurthy, "Design of Vedic IEEE 754 floating-point multiplier," IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT). 20-21 May, pp.1131-1135, New York: IEEE, (2016) DOI: 10.1109/RTEICT.2016.7808008.
- [9]. S. Havaldar and K. S. Gurumurthy, "Design of Vedic IEEE 754 floating point multiplier," 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 2016, pp. 1131-1135, Doi: 10.1109/RTEICT.2016.7808008.

- [10]. M. K. Jaiswal and H. K. -H. So, "Dual-mode double precision / two-parallel single precision floating point multiplier architecture," 2015 IFIP/IEEE International Conference on Very Large-Scale Integration (VLSI-SoC), Daejeon, Korea (South), 2015, pp. 213-218, Doi: 10.1109/VLSI-SoC.2015.7314418.
- [11]. Sunesh N.V and Sathishkumar P, "Design and implementation of fast floating point multiplier unit," 2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA), Bengaluru, India, 2015, pp. 1-5, Doi: 10.1109/VLSI-SATA.2015.7050478.
- [12]. A. P. Ramesh, A. V. N. Tilak and A. M. Prasad, "An FPGA based high speed IEEE-754 double precision floating point multiplier using Verilog," 2013 International Conference on Emerging Trends in VLSI, Embedded System, Nano Electronics and Telecommunication System (ICEVENT), Tiruvannamalai, India, 2013, pp. 1-5, Doi: 10.1109/ICEVENT.2013.6496575.
- [13]. A. Sundhar, S. D. Tharshini, G. Priyanka, S. Ragul and C. Saranya, "Performance Analysis of Wallace Tree Multiplier with Kogge Stone Adder using 15-4 Compressor," 2019 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 2019, pp. 0903-0907, Doi: 10.1109/ICCSP.2019.8697981.
- [14]. S. Jayarajkumar and K. Sivanandam, "Design and implementation of 16-bit systolic multiplier using modular shifting algorithm," 2016 Second International Conference on Science Technology Engineering and Management (ICONSTEM), Chennai, India, 2016, pp. 532-537, Doi: 10.1109/ICONSTEM.2016.7560950.

## **APPENDICES**

### **VERILOG CODES**

i. Verilog code for Wallace tree multiplier and Kogge stone adder combination.

```
module FloatingPointMultiplier (
                                   input [31:0] A,
                                                       // First
operand (32-bit single precision)
                                   input [31:0] B,
                                                      // Second
operand (32-bit single precision)
                                   output [31:0] result // 32-bit
floating point result
);
  // Extract the sign, exponent, and mantissa from A and B
wire signA = A[31]; wire signB = B[31];
                                             wire [7:0]
\exp A = A[30:23]; wire [7:0] \exp B = B[30:23];
[22:0] mantA = A[22:0]; wire [22:0] mantB = B[22:0];
  // Calculate the sign of the result
wire resultSign = signA ^ signB;
  // Add the exponents (including the bias) wire [8:0] \exp Sum = \exp A + \exp B - 8'h7F;
// IEEE 754 bias for single precision is 127
  // Normalize the mantissas (adding the implicit leading 1 for normalized numbers)
wire [24:0] mantA_norm = \{1'b1, mantA\};
                                             wire [24:0] mantB_norm = \{1'b1,
mantB};
  // Mantissa multiplication (24-bit precision)
wire [49:0] mantResult;
                          assign mantResult =
mantA_norm * mantB_norm;
  // Kogge-Stone Adder to add partial sums of mantissa multiplication
  wire [49:0] sum;
  KoggeStoneAdder KSA (
     .A(mantResult[49:25]),
```

```
.B(mantResult[24:0]),
     .sum(sum)
  );
  // Normalize the product mantissa (if necessary)
                                                      wire
[23:0] resultMantissa;
                         wire [7:0] resultExponent;
                                                      wire
shift = sum[49]; // Check for any left shifts needed
assign resultMantissa = shift ? sum[48:25] : sum[47:24];
assign resultExponent = expSum + shift;
  // Combine the result sign, exponent, and mantissa
                                                         assign result
= {resultSign, resultExponent, resultMantissa[22:0]};
endmodule
module KoggeStoneAdder (
  input [24:0] A, // First input (partial sum)
input [24:0] B, // Second input (partial sum)
output [49:0] sum // Final sum
);
  wire [49:0] carry; // Carry bits
                                    wire [49:0] P,
G; // Propagate and Generate bits
                                     wire [49:0]
C:
      // Carry propagate
  // Propagate and generate terms
                                     assign P = A^{\wedge}
B; // Propagate: XOR of A and B
                                     assign G = A
& B; // Generate: AND of A and B
  // First stage of Kogge-Stone Adder
                                         assign carry[0] = 0; //
No carry-in for the least significant bit assign C[0] = 0;
Initial carry is 0
```

```
\label{eq:carry} \begin{tabular}{ll} // Carry and sum calculations & genvar i; & generate \\ for $(i=1; i < 50; i=i+1)$ begin : $carry_propagate $$ assign $carry[i] = G[i-1] \mid (P[i-1] \& carry[i-1]); $$ assign $sum[i] = P[i] \land carry[i]; $$ end endgenerate $$ endmodule $$ \end{tabular}
```

ii. Verilog code for Wallace tree multiplier and Carry bypass adder combination.

```
module floating point multiplier(
                                    input [31:0] A,
// Input A (32-bit single precision)
                                    input [31:0] B,
// Input B (32-bit single precision)
                                    output [31:0]
product // 32-bit product output
);
  // Extract the sign, exponent, and mantissa for both inputs
wire sign_A, sign_B, sign_product;
                                    wire [7:0] exp A,
exp_B, exp_product; wire [23:0] mant_A, mant_B,
mant_product;
  // Sign bit (1 bit)
assign sign_A = A[31];
assign sign_B = B[31];
  // Exponent (8 bits) and subtract the bias (127 for single precision)
assign exp A = A[30:23]; assign exp B = B[30:23];
  // Mantissa (23 bits for normalized mantissa, with implicit leading 1)
assign mant_A = \{1'b1, A[22:0]\};
  assign mant_B = \{1'b1, B[22:0]\};
  // Multiply mantissas (24 bits * 24 bits = 48 bits result)
wire [47:0] mantissa product;
mantissa_product = mant_A * mant_B;
```

```
// Add exponents (with bias of 127)
                                       wire [8:0]
raw_exp_product = exp_A + exp_B - 8'd127;
exp_product = raw_exp_product[7:0];
  // Determine the sign of the product (XOR of the sign bits) assign
sign product = sign A ^ sign B;
  // Check if the mantissa product requires normalization wire normalize flag; assign
normalize_flag = mantissa_product[47]; // Check the MSB of mantissa for
normalization
  // Mantissa normalization logic
wire [47:0] mantissa_result;
[7:0] adjusted_exp_product;
  // If normalization is needed, shift the mantissa and adjust the exponent
  assign mantissa result
                                      normalize flag
                              =
         mantissa_product[46:24]
                                      : mantissa_product[45:23];
  assign adjusted_exp_product = normalize_flag ? exp_product + 1 : exp_product;
  // Final product (sign + exponent + mantissa) assign product = {sign_product,
adjusted_exp_product, mantissa_result[22:0]}; // 1 bit for sign, 8 bits for exponent, 23
bits for mantissa
    endmodule
```

## iii. Verilog code for Systolic multiplier and Carry bypass adder combination.

```
module fp_multiplier( input [31:0] a, // 32-bit IEEE 754 floating-point input a input [31:0] b, // 32-bit IEEE 754 floating-point input b output [31:0] result // 32-bit IEEE 754 floating-point result );
```

```
// Step 1: Extract sign, exponent, and mantissa components of both inputs wire
sign_a = a[31]; wire sign_b = b[31]; wire [7:0] exponent_a = a[30:23]; wire [7:0]
exponent_b = b[30:23]; wire [23:0] mantissa_a = {1'b1, a[22:0]}; // Append implied
                         wire [23:0] mantissa b = \{1'b1, b[22:0]\};
leading 1 for IEEE 754
  // Step 2: Multiply the mantissas using a systolic array multiplier
                             systolic multiplier
wire [47:0] mantissa mult;
systolic_mult_inst (
     .a(mantissa_a),
     .b(mantissa_b),
     .product(mantissa_mult)
  );
  // Step 3: Add exponents using the Kogge-Stone adder and adjust for bias (127 in
    IEEE 754)
                 wire [8:0]
exponent_sum_raw;
kogge stone adder adder inst (
     .a(exponent_a),
     .b(exponent_b),
     .cin(8'd127), // Adjust for bias
     .sum(exponent_sum_raw)
  );
  // Step 4: Assemble final result, apply normalization, and handle edge cases
                                                                                  wire
[22:0] normalized_mantissa = mantissa_mult[46:24]; // Take 23 most significant bits
  wire [7:0] final_exponent = exponent_sum_raw[7:0] + mantissa_mult[47]; // Adjust
    exponent if overflow in mantissa
  // Handle signs and assemble IEEE 754 format result
                                                         assign result =
{sign a ^ sign b, final exponent, normalized mantissa}; endmodule
// Systolic multiplier for 24-bit mantissa multiplication module
systolic_multiplier (
                      input [23:0] a, // 24-bit mantissa a
```

```
[23:0] b, // 24-bit mantissa b
                                output reg[47:0] product // 48-bit
product of mantissa multiplication
);
  // Simple systolic multiplier logic for 24x24 multiplication
reg [47:0] partial_product [23:0];
                                     integer i;
  always @(*)begin
     // Initialize partial products
                                       for (i = 0; i <
24; i = i + 1) begin
                              partial_product[i] =
a[i]? (b << i): 48'd0;
                              end
     // Sum the partial products
                                      product
         for (i = 0; i < 24; i = i + 1) begin
product = product + partial_product[i];
end
      end
endmodule
// Carry bypass Adder for 8-bit exponent addition with carry input
module carry_bypass_adder (
  input [7:0] a, // 8-bit input a input [7:0] b, // 8-
             input [7:0] cin, // Bias of 127 in IEEE
bit input b
              output [8:0] sum // 9-bit sum to
754 format
handle overflow
                        // Propagate and generate
); wire [7:0] p, g;
wire [7:0] c;
                    // Carry
  // Stage 0: Initial propagate and generate
assign p = a \wedge b;
                    // Propagate
                                    assign g
= a \& b;
            // Generate
  // Carry bypass stages for carry computation
assign c[0] = g[0] | (p[0] \& cin[0]);
                                       assign
c[1] = g[1] | (p[1] \& c[0]);
                              assign c[2] =
```

```
\begin{split} &g[2] \mid (p[2] \& c[1]); \quad assign \ c[3] = g[3] \mid \\ &(p[3] \& c[2]); \quad assign \ c[4] = g[4] \mid (p[4] \& c[3]); \quad assign \ c[5] = g[5] \mid (p[5] \& c[4]); \\ &assign \ c[6] = g[6] \mid (p[6] \& c[5]); \quad assign \\ &c[7] = g[7] \mid (p[7] \& c[6]); \\ &\text{// Final Sum computation} \quad assign \\ &sum = \{1'b0, p\} \land \{c, cin[7]\}; \\ &endmodule \end{split}
```

### iv. Verilog code for Systolic multiplier and Kogge stone adder combination.

```
module floating point multiplier(
                                      input [31:0]
a, // First input (floating point)
                                  input [31:0] b,
// Second input (floating point)
                                  output [31:0]
product // Output (floating point)
);
  // Extract sign, exponent, and mantissa
                                             wire
sign_a = a[31]; // Sign bit of a wire sign_b =
b[31]; // Sign bit of b
                        wire [7:0] exponent_a =
a[30:23]; // Exponent of a
                             wire [7:0] exponent_b =
b[30:23]; // Exponent of b
                             wire [22:0] mantissa_a
= a[22:0]; // Mantissa of a
                             wire [22:0] mantissa_b
= b[22:0]; // Mantissa of b
  // Compute sign of the product
                                      wire sign product = sign a \wedge sign b; //
Product sign is XOR of input signs
  // Compute exponent
                             wire [8:0] exponent_sum = \{1'b0, exponent_a\} + \{1'b0, exponent_a\}
exponent_b} - 127; // Add exponents and subtract bias
  wire [7:0] exponent_product = exponent_sum[7:0]; // 8-bit exponent for product
```

```
// Normalize mantissas by adding implicit leading 1
normalized_mantissa_a = {1'b1, mantissa_a}; // 24 bits for mantissa of a
                                                                          wire [23:0]
normalized_mantissa_b = {1'b1, mantissa_b}; // 24 bits for mantissa of b
  // Systolic Multiplier (Direct Binary Multiplication)
                                                       wire [47:0] mantissa product
= normalized mantissa a * normalized mantissa b; // 24 \times 24 = 48 bits
  // Kogge-Stone Adder for normalization wire [23:0]
final mantissa; // Final normalized mantissa
kogge_stone_adder KSA (
     .a(mantissa_product[46:23]), // Top half of the product (23 bits)
     .b(mantissa_product[22:0]),
                                      // Bottom half of the product (23 bits)
.sum(final_mantissa)
                          // Output sum
  );
  // Adjust for normalization (if mantissa overflowed)
                                                        wire [7:0] adjusted exponent
= exponent product + (final mantissa[23]? 1:0); // Increment exponent if needed
  // Assemble final product
  assign product = {sign product, adjusted exponent, final mantissa[22:0]}; //
    Combine sign, exponent, and mantissa
endmodule
module kogge_stone_adder(
  input [23:0] a, // First input to the adder
input [23:0] b, // Second input to the adder
output [23:0] sum // Sum output
);
  // Kogge-Stone adder logic for 24-bit addition
wire [23:0] G; // Generate signals wire [23:0]
P; // Propagate signals wire [24:0] C; // Carry
         wire [24:0] S; // Sum output
signals
```

```
// Generate and propagate signals
assign G = a & b; // Generate
P = a \wedge b; // Propagate
  // Carry lookahead logic
  assign C[0] = 1'b0; // Carry in for the first stage is zero
  // Carry calculation using Kogge-Stone logic
genvar i;
            generate
    for (i = 0; i < 24; i = i + 1) begin: carry_generation
assign C[i + 1] = G[i] | (P[i] \& C[i]); // Carry out is generated
      endgenerate
end
  // Sum calculation
  assign S = P \land C[24:1]; // Final sum is P XOR with carry
  // Output the lower 24 bits of the sum
assign sum = S[23:0];
    endmodule
```

## **CERTIFICATES**

The certificates issued by IJSREM recognize the successful completion of our project titled "Floating Point Multiplier Using High-Performance Adders and Multipliers." These certificates serve as an acknowledgment of the project's significant contribution to advancements in computational design and efficiency. This achievement underscores our dedication to innovation and may pave the way for further research opportunities and collaborations











### PERSONAL DETAILS



#### **Assistant Professor Guide:**

Name: Mr.Chethan B R received B.E. degree in Electronics and Communication Engineering in M.C. Hassan Collage and MTech from Bangalore.

Email id: chethan.br@pestrust.edu.in

Contact No: 9964874684



Name: T S Samarth USN:4PM21EC098

Email id: tssam1547@gmail.com

Mobile No:7676143660



Name: Praveen T R USN:4PM21EC074

Email id: praveentr961@gmail.com

Mobile No:8073851774



Name: Vishwas V S USN:4PM21EC103

Email id: vishwasachar7784@gmail.com

Mobile No: 7975066328



Name: Mruthyunjaya S D USN:4PM21EC060

Email id: mruthyunjayasd9535@gmail.com

Mobile No: 8088461545